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Abstract 



Sliced Inverse Regression (SIR) is an effective method for dimension reduction in 
high-dimensional regression problems. The original method, however, requires the 
inversion of the predictors covariance matrix. In case of collinearity between these 
predictors or small sample sizes compared to the dimension, the inversion is not possi- 
ble and a regularization technique has to be used. Our approach is based on a Fisher 
Lecture given by R.D. Cook where it is shown that SIR axes can be interpreted as 
solutions of an inverse regression problem. In this paper, a Gaussian prior distribution 
is introduced on the unknown parameters of the inverse regression problem in order to 
regularize their estimation. We show that some existing SIR regularizations can enter 
our framework, which permits a global understanding of these methods. Three new 
priors are proposed leading to new regularizations of the SIR method. A comparison 
on simulated data is provided. 
. Keywords: Inverse regression, regularization, sufficient dimension reduction. 
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1 Introduction 

Many methods have been developed for inferring the conditional distribution of an univari- 
ate response Y given a predictor X in W, ranging from linear regression p3] to support 



vector regression [12]. When p is large, sufficient dimension reduction aims at replacing 
the predictor X by its projection onto a subspace of smaller dimension without loss of 
information on the conditional distribution of Y given X. In this context, the central sub- 
space, denoted by Sy\x plays an important role. It is defined as the smallest subspace such 
that, conditionally on the projection of X on Sy\Xi Y an d X are independent. In other 
words, the projection of X on Sy\x contains all the information on Y that is available in 
the predictor X. 

The estimation of the central subspace has received considerable attention since the 
past twenty years. Without intending to be exhaustive, we refer to Sliced Inverse Re- 
gression (SIR) [21], sliced average variance estimation [7], and graphical regression [8] 
methods. Among them, SIR seems to be the most popular one. The original method 
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has been adapted to various frameworks and the relative asymptotic properties have been 
derived, see for instance [HI E2J [Ml H3 ES] • We also refer to [3J for a study of the SIR 
finite sample properties, to [151 [27] for the estimation of X = dim(iSyijjf), the dimension 
of the central subspace, and to [H [16] for extension to functional covariates. 

Assuming that K is known and introducing E = Cov(X), in the SIR methodology, 
a basis of the central subspace is obtained by computing the eigenvectors associated to 
the largest K eigenvalues of E _1 Cov(E(A"|Y)). Unfortunately, the classical n— sample 
estimate E of E can be singular, or at least ill-conditioned, in several situations. Indeed, 
since rank(E) < min(n — l,p), if n < p then E is singular. Even when n and p are of the 
same order, E is ill-conditioned, and its inversion introduces numerical instabilities in the 
estimation of the central subspace. Similar phenomena occur when the coordinates of X 
are highly correlated. 

Some regularizations of the SIR method have been proposed to overcome this limita- 
tion. In [6] and [22] , a Principal Component Analysis (PCA) is used as a preprocessing 
step in order to eliminate the directions in which the random vector X is degenerated. 
Thus, for a properly chosen dimension d of the projection subspace, the covariance matrix 
of the projected observations is regular. In the sequel, this technique will be referred to 
as PCA+SIR. Another method consists in adopting a ridge regression technique (see for 
instance [14], Chapter 17) i.e. replaces the sample estimate E by a perturbed version 
E + rip where I p is the p x p identity matrix and r is a positive real number [31 j . Here, 
the idea is that, for r large enough, E + rl p is regular and its condition number increases 
with r. Similarly, in [2SJ E2], regularized discriminant analysis [T7] is adapted to the 
SIR framework. More recently, it is proposed in [23J to interpret SIR as an optimization 
problem and to introduce L\— and L2— penalty terms in the optimized criterion. 

Our approach is based on a Fisher Lecture given by R.D. Cook [10] where it is shown 
that the axes spanning the central subspace can be interpreted as the solutions of an 
inverse regression problem. In this paper, a Gaussian prior is introduced on the unknown 
parameters of the inverse regression problem in order to regularize their estimation. We 
show that the previously mentioned techniques [61 [22], [3T] can enter our framework, which 
permits a global understanding of these methods. Three new priors are proposed leading 
to new regularizations of the SIR method. A comparison with the L<i— regularization 
introduced in [23] is also provided. It is shown that, from the theoretical point of view, 
the proposed Li— regularization cannot be justified. 

This paper is organized as follows. In Section an adaptation of the inverse regres- 
sion model to our framework is presented. Section [3J is dedicated to the regularization 
aspects. Theoretical comparisons with existing approaches as well as new methods are 
provided. Finite sample properties are illustrated in Section [H Proofs are postponed to 
the Appendix. 
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2 Inverse regression without regularization 



Consider X a W— random vector, Y the real response variable and let us denote by S Y \x 
the central subspace. In the following, for the sake of simplicity, we assume that K = 
dim(SV|x) = 1) the case 1 < K < p being discussed in Section[3l We thus introduce (3 G W 
such that Sy\x = span(/3). In Subsection 12.11 the considered inverse regression model is 
presented. The estimation of the unknown parameters is discussed in Subsection 12.21 and 
the links with the SIR method are established in Subsection [ 

2.1 Single-index inverse regression model 

The following inverse single-index regression model is considered (see [10] ; equation (2) 
for the multi- index model): 

X = fi + c(Y)Vb + e, (1) 

where /i and b are non-random R p — vectors, e is a centered W p — Gaussian random vector, 
independent of Y, with covariance matrix Cov(e) = V and c : R — ► R is a nonrandom 
function. Under this model, 

E(X\Y = y) = fi + c(y)Vb, 

and thus, after translation by [a, the conditional expectation of X given Y is a degenerated 
random vector located in the direction Vb. Prom [10], Proposition 1, b corresponds to the 
direction f3 of the central subspace. In the sequel, it will appear that, under appropriate 
conditions, the maximum likelihood estimator of b is (up to a scale parameter) the SIR 
estimator of (3. Moreover, note that, under ([1]), one has 

E(b t (X- i ,)\Y = y) 

c(y) = Wl • (2) 

Now, restricting ourselves to the single-index case, the forward model of SIR asserts that 
there exists a univariate link function g such that E(Y"|X) = g(b l X) or equivalently, 
b l X = 5 - 1 (E(y|X)). Thus, replacing in © yields 

c(y) = WVb ' 

i.e. the coordinate function is, up to an affine function, the inverse of the link function in 
the single-index forward model of SIR. 

2.2 Maximum likelihood estimation 

We now address the estimation of the coordinate function c(.), the direction b, the co- 
variance matrix V and the location parameter fi in model ([!]). To this end, we focus on 
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a projection estimator of the unknown function c(.). More precisely, it is expanded as a 
linear combination of h basis functions Sj(.), j = 1, . . . , h: 

h 

c(-) =2^c J -s i (.), (3) 
i=i 

where the coefficients c,-, j = 1, . . . , h are unknown whereas h is supposed to be known. 
Introducing c = (ci, . . . , c^) and s(.) = (si(.)j . . . ,s/ l (.))*, model ([1]) can be rewritten as 

X = fl + s \Y)cVb + e, e~M(0,V), (4) 

where A/"(0, V) is the multivariate centered Gaussian distribution with covariance ma- 
trix V . In the sequel we denote by 

_ - b l Vb 

the signal to noise ratio in the direction b. Let (Xi,Yi), i = l,...,n be a sample of 
independent random variables distributed as (X, Y). Clearly, estimating (fi,V,b,c) by 
maximization of the likelihood in model (HJ) consists in minimizing 

1 n 

G{n, V, b, c) = log det V + - Yifi + s\Yi)cVb - XifV' 1 ^ + s\Yi)cVb - X { ), (5) 

i=i 

with respect to (//, V, b, c). Note that G{ii, V, b, c) can also be interpreted as a discrep- 
ancy functional, see equation (5) in [llj. Up to our knowledge, the introduction of such 
functional in the inverse regression framework is due to [9]. Let us introduce the h x h 
empirical covariance matrix W of s(Y) defined by 

n 

W = -J2(4Yi)-s)(s(y i )-s) t , 

i=l 

the h x p matrix M defined by 

1 n 

M = -Y,(s(Y l )-s){X,-X)\ 
i=i 

and S the empirical p x p covariance matrix of X 

1 n 

Z = -J2{X i -X)(X i -X)\ 

where 

n 1 n 

I=-Vl, and s = -Vs(yi). 
n ^— ' n ^— ' 

i=i i=i 
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Lemma 1 Using the above notations, G(ji, V, b, c) can be rewritten as 

G{n,V,b,c) = \ogdetV + tr{tV- 1 ) + {^-X + s t cVb) t V- 1 {ii-X + s t cVb) 
+ (c t Wc)(b t Vb)-2c t Mb. 

The maximum likelihood estimators of //, V, b and c are closed-form. 

Proposition 1 Under ifW and S are regular, then the maximum likelihood estimator 
of (n, V, b, c) is defined by: 

• b is the eigenvector associated to the largest eigenvalue A of S _1 M t W~ 1 M , 

• 6= -J^W^Mb, 

b l Vb 

• p, = X - s^Vb, 

• V = S - J^rtbtit. 

As a consequence of the above equation, one also has A = 1 — b Vb/l rS6, which provides 
an estimation of the signal to noise ratio in the direction b through 

P = A/(1-A). (6) 

Let us now show that the SIR method corresponds to the particular case of piecewise 
constant basis functions Sj(.), j = 1, . . . , h. 

2.3 Sliced Inverse Regression (SIR) 

Suppose the range of Y is partitioned into h+1 non-overlapping slices Sj, j = 1, . . . , h + 1 
and consider the h basis functions defined by 

8j (.)=l{.eSj}, j = l,...,h (7) 

where I is the indicator function. Let us denote by nj the number of Y$ in slice j = 
1, . . . , h + 1, define the corresponding proportion by fj = nj/n, the empirical mean of X 
given Y £ Sj by 

Xj = — y Xi 

and let T be the p x p empirical "between slices" covariance matrix defined by 

h+1 

r = Y,f^-x){x j -xf. 

In this context, the following consequence of Proposition [T] can be established. 
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Corollary 1 Under ^ and ([?[), j/E is regular, then the maximum likelihood estimator 
b of b is the eigenvector associated to the largest eigenvalue o/E —1 r. 

To summarize, the maximum likelihood estimator of b is (up to a scale factor) the classical 
SIR estimator of the direction j3 spanning the central subspace. The next section is 
dedicated to the introduction of a regularization in the inverse regression problem in order 
to avoid the inversion of S. 

3 Regularized inverse regression 

First, we present in Subsection 13.11 how the introduction of a prior information on the 
unknown direction b can overcome the SIR limitations due to the ill-conditioning or sin- 
gularity of E. Second, some links with the existing SIR regularizations are highlighted in 
Subsection 13.21 Finally, basing on our framework, three new regularizations of the SIR 
method are introduced in Subsection [ 



3.1 Introducing a Gaussian prior 

We propose to introduce a prior information on the projection of X on b appearing in the 
inverse regression model. More precisely, we focus on 

E(6 t (X - n)\Y)b 



bWb 

from ([2|) and ([3]) , assuming that 



c(Y)b = s\Y)cb, 



(1 + p)- 1/2 (s(Y) - s) t c6~JV(0,n). (8) 

The role of the matrix Q is to describe which directions in W are the most likely to 
contain b. Some examples are provided in the next two paragraphs. The scalar (1 + p) -1 / 2 
is introduced for normalization purposes, permitting to preserve the interpretation of the 
eigenvalue in terms of signal to noise ratio. As a comparison, in [2], a Bayesian estimation 
method is proposed basing on B-splines approximation of the link function g in the forward 
model and a Fisher-von Mises prior on the direction b. In our approach, working on the 
inverse regression model allows to obtain explicit solutions, see Lemma[2]and Proposition [2] 
below. 

Lemma 2 Maximum likelihood estimators are obtained by minimizing 

Gn (^M = (^M + (t '"~'^ >W < 9 > 

with respect to (fi,V,b,c). 

Comparing to ©, the additional term due to the prior information can be read as a 
regularization term in Tikhonov theory, see for instance [30j, Chapter 1, penalizing large 
projections. The following result can be stated: 
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Proposition 2 Under and if W and Q,Y, + I p are regular, then the maximum 
likelihood estimator of (//, V, b, c) is defined by: 

• b is the eigenvector associated to the largest eigenvalue A of (ttYi + Ip)~ 1 QM t W~ 1 M, 

1 ~ - 5*Jl _1 fe 

• c = - — 7^-^rW~ x Mb, where ri(b) = „ ^ ^ , 

(1 + ?7(6))6*y& VXb 

• p, = X - s l cVb, 

• V = t - J^rtWt. 

b l zb 

Compared to Proposition [TJ the inversion of S is replaced by the inversion of fiE + I p . 
Thus, for a properly chosen prior matrix Q, the numerical instabilities in the estimation of 
b disappear. Note that, since the estimation of V is formally the same as in Proposition [H 
the interpretation ([6]) of A still holds. As previously, this result can be applied to the 
particular case of the SIR method. 

Corollary 2 Under Q), ([?|) and (0), + I p is regular, then the maximum likelihood 

estimator of b is the eigenvector associated to the largest eigenvalue of + Ip)^ 1 ^}!". 

In the following, the above estimator of the direction b will be referred to as the Gaus- 
sian Regularized Sliced Inverse Regression (GRSIR) estimator. Let us emphasize that the 
GRSIR estimator can be extended to the multi-index situation by considering the eigenvec- 
tors bi, . . . , bx associated to the K largest eigenvalues Ai > • • • > \k of (OS + / p ) _1 iir. 
For instance, one can show that 62 maximizes Gq under the orthogonality constraint 
6 t 1 (S + O- 1 )6 2 = 0. 

Some examples of possible prior covariance matrices f2 are now presented. 

3.2 Links with existing methods 

In all the next examples, a non- negative parameter r is introduced in the prior covariance 
matrix in order to tune the importance of the penalty term in ([9]). Consequently, in the 
sequel, r is called a regularization parameter. 

Classical SIR approach. It is easily seen from Corollary [2] that choosing the prior 
covariance matrix Qq = (rS) _1 in GRSIR gives back SIR, and this for all r > 0. This 
prior matrix indicates that directions corresponding to small variances are most likely, i. e 
the SIR method favors directions in which S is close to singularity. In practice, this choice 
yields instabilities in the estimation. 
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Ridge approach [ 28[ 1291 131) . The simplest choice for the prior covariance matrix 
is fii = T -1 /p. In this case, the identity matrix indicates that no privileged direction 
for b is available. Following Corollary [21 the corresponding GRSIR estimator of b is the 
eigenvector of (E-\-tI p )~ 1 T associated to its largest eigenvalue, which is the ridge estimator 
introduced independently in [5T] and [28\ 129]. 

PCA+SIR approach [6, 22J. As already seen, a popular technique to overcome the 
singularity problems of S is to use PC A as a preprocessing step O [22]. The princi- 
ple is the following. Let d £ {l,...,p} be fixed and denote by Ai > • • • > A^ the d 
largest eigenvalues of S (supposed to be positive), qi, . . . , the associated eigenvectors 
and Sd = span(qi, . . . , qj) the linear subspace spanned by q\, . . . , q^. The first step consists 
in projecting the predictors on Sd- The second step is to perform SIR in this subspace. The 
next result shows that this method corresponds to a particular prior covariance matrix. 

Proposition 3 PCA+SIR corresponds to GRSIR with prior covariance matrix 



where r > is arbitrary. 

Let us note that although depends on r, the GRSIR estimator does not, since there is 
no regularization parameter in the PCA+SIR methodology. 

Li and Yin's approach |23j. Their proposed L2— regularization consists in estimating 
(6, c) by minimization of 

h 



the matrix N being either N = I p or N = T, . In our opinion, this approach suffers from 
a lack of invariance since the functional H T does not penalize the same way two different 
axes (b and 2b for instance) defining the same direction. As a consequence, we have shown 
([5], Proposition 1) that the only possible solution b of the minimization problem is b = 0: 
In view of this result, the proposed alternating least squares algorithm ([23], Section 2) 
cannot be justified theoretically. As a comparison, our method does not yield this kind 
of problem thanks to the invariance property: Gq(/j,, V, tb, c/t) = Gq(//, V, b, c) for all real 
numbers ( ^ 0. We now propose some alternative choices of the covariance matrix £1 
yielding new regularizations of the SIR method. 

3.3 Three new SIR regularizations 

Tikhonov regularization. An alternative choice of the prior covariance matrix is ^3 = 
r _1 S. Comparing ^3 to the matrix S7q associated to the SIR method, it appears that 




H T (b, c) = Y^ fjiv + ^ c i h ~ x jY N (v + Zcjb ~ Xj) + rb l b 
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the underlying ideas are opposite. Here, directions corresponding to large variances are 
most likely. The associated GRSIR estimator of the direction b is the eigenvector of 
(E 2 + r/ p )~ 1 Sr associated to its largest eigenvalue. In the following, this estimator will 
be referred to as the Tikhonov estimator. Indeed, let us recall that the classical SIR 
estimator is obtained by a spectral decomposition of S _1 r. For all k = 1, . . . ,p, denote by 
Xj. the k— th column of this matrix. Computing is equivalent to solving with respect to 
x the linear system = 1^ where 1^ is the k— th column of T. The associated Tikhonov 
minimization problem (see (1-34) in [30]) can be written as 

Xk = argmin II Ex — rdl 2 + tI|x|| 2 = (£ 2 + Tl p )~ 1 Y,Tk- 

X 

Thus, in this framework, (S _1 r)fc is estimated by (X 2 +t/ p ) _1 XT^. and consequently £ _1 T 
is estimated by (S 2 + r/ p )" 1 Sf . 

Dimension reduction approaches. It has been seen in Proposition^ that the PCA+SIR 
approach is equivalent to using the prior covariance matrix hi GRSIR. The following 
result is an extension to more general covariance matrices. 

Proposition 4 For all real function ip let 

d 

i=i 

Then, the associated GRSIR estimator can be obtained by first projecting the predictors 
on Sd = span(qi, . . . ,qd) and second performing GRSIR on the projected predictors with 
prior covariance matrix 

p 

i=i 

The dimension d plays the role of a "cut-off' parameter, since when computing b, all 
directions qd+i, ■ ■ ■ ,q P are discarded. Three illustrations of this result can be given: 

• Choosing <p(t) = l/(rt), we obtain ft(l/(Yld)) = 2 and f2(l/(rld)) = (rS)" 1 = O , 
where Id is the identity function. It appears that Proposition [3] is a particular case 
of Proposition HJ As already discussed, since the choice of as a prior covariance 
matrix seems not very natural, we thus propose two new choices. 

• First, ip(t) = 1/r yields 

d 

T i=i 

and 0(1/t) = I p /t = fii. Consequently, this new method consists in applying the 
ridge approach [31] on the projected predictors, the interpretation being that, in the 
subspace Sd, all directions share the same prior probability. This method will be 
referred to as PCA+ridge. 
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• Second, (p(t) = tjr yields 

1 d 

n 5 = / 0(Id/r) = -^A^, 

and 0(ld/r) = E/r = ^3. This new method consists in applying Tikhonov approach 
on the projected predictors. In this context, directions of carrying a large fraction 
of the total variance of X are more likely. This method will be referred to as 
PCA+Tikhonov. 



4 Numerical experiments 

GRSIR methods associated to the prior covariance matrices f?o (SIR), fii (ridge), 
(PCA+SIR), ft 3 (Tikhonov), ft 4 (PCA+ridge) and tt 5 (PCA+Tikhonov) are compared 
on simulated data. A sample of size n = 100 of the random pair (X, Y) is considered, 
where X € MP with p = 50 and The random vector X is Gaussian, centered, with 

covariance matrix E = Q/S.Q 1 where A is the diagonal matrix containing the eigenvalues 
of E defined by A =diag{p 9 , (p — I) 6 ', . . . , l e } and Q is a randomly chosen orthogonal 
matrix. Several values of 9 will be considered. Note that the condition number of E is 
given by p e and is thus an increasing function of 9. As suggested by [19] . the orthogonal 
matrix Q is obtained as follow: First, we construct a p x p matrix where each coefficients 
are randomly chosen from a Gaussian distribution. Next, the matrix Q is obtained by 
applying the QR-decomposition on this matrix. We consider two SIR models: 

Model 1 Y = sin (^-0**) + 



Model 2 Y 



fix 

1/2 



a 

where a is the standard deviation of the projection of X on (3 i.e. a = (/3*E/3) 1//2 , e is 
a centered Gaussian random value with standard deviation 0.03 independent of X. The 
true index is j3 = 5 -1 / 2 Q(l, 1, 1,1,1,0,... , 0)*. Hence in the direction of /3, the variance 
of X is large. 

In all the experiments described below, we replicate = 100 times Models 1 and 2. 
More precisely, for Model 1, we compute the pairs (X,yW), r= 1,. . . ,N where 

yM = S in {^X) + e r , e r ~ M(0, (0.03) 2 ), 

with e r independent of X. The same is done for Model 2. We then obtained N estimators 
W\ r = 1, . . . , N of p. In order to evaluate the quality of the estimate b, we compute two 
criteria. The first one is the mean of the squared cosine (MSC): 

1 N 

MSC = -^(/3*6W) 2 . 
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The second one is a kind of variance of the squared cosine (VSC): 



1 



JV 



VSC 



N{N - 1) 



EE((^ (r) ) 2 - 



r=l r^s 



These criteria can be adapted to multi-index models [15J. The closer these quantities are 
to 1, the better the estimation is. 

First experiment: We illustrate the dependence of the above methods with respect to 
the regularization parameter r. In this aim, we compute the two previous defined criteria 
as function of the regularization parameter. A logarithmic scale was adopted, 150 values 
of log(r) regularly distributed in [—5, 25] were considered. Here, we limit ourselves to 
8 = 2. Moreover, we choose d = 20 in the PCA+SIR, PCA+ridge and PCA+Tikhonov 
methods. Results are displayed on Figure [Q and Figure [2] for Model 1 and on Figure [3] and 
Figure H] for Model 2. It appears that the classical SIR approach gives very poor results 
in such a situation where n and p are of the same order (n/p = 2). Ridge and Tikhonov 
regularizations can bring a significant improvement provided r is large enough. The VSC 
criterion of the ridge and the Tikhonov methods is better than the one of the classical 
SIR method if the regularization parameter is taken sufficiently large. PCA+SIR obtains 
reasonable results compared to SIR, with the advantage of do not requiring the selection 
of r. The selection of d is addressed in our third experiment. Note that PCA+ridge and 
PCA+Tikhonov methods are less sensitive to the choice of r than ridge and Tikhonov 
methods. Concerning the criterion VSC, PCA+ridge and PCA+Tikhonov methods both 
outperform the PCA+SIR for sufficiently large values of the regularization parameter. For 
one of the N = 100 simulated datasets, the pairs (/3*Aj, bXi), i = 1, . . . , n are represented 
on Figure [5] for Model 1 and on Figure [6] for Model 2. In order to make a comparison, 
the estimated axis b is computed with the classical SIR method and the PCA+Tikhonov 
method. It appears clearly that the SIR method leads to very bad oriented estimator of 
{3 which is not the case with the PCA+Tikhonov method. 

Second experiment: The robustness with respect to the condition number is inves- 
tigated by varying 8 in {0,0.1,0.2,... ,3}. For each value within this set, the optimal 
regularization parameter has been selected for each method and the corresponding MSC 
criterion is displayed on Figure [7] and Figure [HJ This is done only for Model 1. Clearly, 
the classical SIR method is very sensitive to the ill-conditioning of the covariance ma- 
trix. For all the other considered methods, results are getting better while the condition 
number increases. Note that ridge and Tikhonov methods as well as PCA+ridge and 
PCA+Tikhonov yield very similar results. 

Third experiment: Illustration of the role of d in PCA+SIR, PCA+ridge and PCA+Tikhonov 
methods. Here, the condition number is fixed by choosing 8 = 2. For each value of d in 
{0, 1, . . . ,p}, the optimal regularization parameter has been selected for each method and 
the corresponding MSC criterion is displayed on Figure [9j Only the Model 1 is considered 
here. One can see that the PCA+SIR method is very sensitive to d. Indeed, if d is large, 
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then this approach reduces to SIR, whose accuracy is low for large dimensions. At the 
opposite, PCA+ridge and PCA+Tikhonov results remain stable as d increases, since these 
methods get close to ridge and Tikhonov methods respectively. 

5 Retrieval of Mars surface physical properties from hyper- 
spectral images 

We propose here to apply GRSIR in the context of a nonlinear inverse problem in remote 
sensing. Hyperspectral remote sensing is a promising space technology regularly selected 
by agencies with regard to the exploration and observation of planets. It allows to col- 
lect for each pixel of a scene, the intensity of light energy reflected from materials as it 
varies across different wavelengths. Hundreds spectels in the visible and near infra-red 
are recorded, making it possible to observe a continuous spectrum for each image cell. 
The analyze of these spectral signatures allows to identify the physical, chemical or min- 
eralogical properties of the surface and of the atmosphere that may help to understand 
the geological and climatological history of planets. 

Our goal is to evaluate the physical properties of surface materials on Mars planet 
from hyperspectral images collected by the OMEGA instrument aboard Mars express 
spacecraft. The used approach is based on the estimation of the functional relationship 
G between some physical parameters Y and observed spectra X. For this purpose, a 
database of synthetic spectra is generated by a physical radiative transfer model and used 
to estimate G. The high dimension of spectra (p = 184 wavelengths) is reduced by using 
GRSIR. Results are compared with the traditional SIR approach. 

5.1 Data 

In this application, we focus on an observation of the south pole of Mars at the end 
of summer. It has been collected during orbit 61 by the French imaging spectrometer 
OMEGA on board Mars Express Mission. A detailed analysis of this image [13] revealed 
that this portion of Mars mainly contains water ice, carbon dioxide and dust. This has 
led to the physical modeling of individual spectra with a surface reflectance model. This 
model allows the generation of 12.000 synthetic spectra with the corresponding parameters 
that constitute a learning database. Here, we focus on the terrain unit of strong CO2 
concentration determined by a classification method based on wavelets [26J. It contains 
approximately 9,000 spectra to reverse. The 5 most important parameters characterizing 
the morphology of these spectra are the proportions of CO2 ice, water ice and dust; and 
the grain sizes of water ice and CO2 ice. In the sequel only one parameter, the grain size 
of CO2 ice is presented. A detailed analysis for all other parameters can be found in |4j. 
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5.2 Methodology 

Two methods are compared in order to reverse the hyperspectral image: SIR and GRSIR 
with the PCA + ridge approach. Both methods allow to choose a lower dimensional 
regressor space, sufficient enough to describe the relationship of interest. In practice, it 
appears that a unidimensional regressor space gives satisfactory results [3J. Thus, we 
assume that there exists a function g from R to R such that: G(X) = g(< j3,X >). As a 
consequence, the estimation of the relationship G reduces to computing one direction f3, 
the univariate function g being estimated by piecewise linear regression. In the PCA+ridge 
approach, the cut-off dimension is fixed to d = 20. The regularization parameter, fixed to 
r = 10 -4 ' 2 , is chosen to minimize the mean squared error when estimating the grain size 
of CO2 ice of the learning database itself. An example of the relationship between reduced 
spectra and the grain size of CO2 estimated using GRSIR is presented on Figure [TUJ It 
shows that the relationship is nonlinear and that one direction seems to be sufficient to 
estimate the grain size of CO2. 

5.3 Results 

With GRSIR methodology, the inversion of the image from orbit 61 shows a smooth 
mapping of the grain size of CO2 (see Figure fT2j) making it possible to distinguish some 
areas with great sizes of CO2 ice on the boundaries and some areas with small values 
inside the cap. On the opposite, with the traditional SIR approach, estimates assume 
a small number of values that seem to be distributed randomly and correspond to the 
minimum and maximum values of the parameter in the learning database (see Figure [TTj) . 
These poor results can be explained by the very high condition number, about 10 14 , of 
the empirical covariance matrix. Since no ground data is available, it is quite difficult to 
quantify the accuracy of GRSIR estimates. However, comparisons with other approaches 
or with estimates from other hyperspectral images from the same portion of Mars [4J give 
consistent results. GRSIR approach then appears promising for model inversion in remote 
sensing. 

6 Concluding remarks 

A new framework has been presented to regularize the SIR method. It provides new 
interpretations of the existing methods as well as the construction of new regularization 
techniques. Among them, it appears that the PCA+ridge and PCA+Tikhonov methods 
are interesting alternatives to the existing PCA+SIR [6j [22] and ridge [31] methods. The 
use of a " cut-off' dimension d permits to limit the sensitivity to the choice of the regular- 
ization parameter r. The choice of this dimension itself seems not crucial since for large 
value the above methods are close to the ridge and Tikhonov techniques. In our exper- 
iments, the choice d = p/2 appears as a good heuristics in most situations. Of course, 
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an automatic selection of (r, d) would be of interest. To this end, the construction of a 
generalized cross-validation criterion is under investigation. We also plan to study the 
introduction on non-Gaussian priors in order to obtain L\ — penalizations and thus sparse 
estimates of (3. 
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Appendix: Proofs 

Proof of Lemma Q3 Let us remark that 



1 

A := G(u, V, b, c) - logdet V = - > ZjV~ l Zi 

n. — ' 



(10) 



i=l 



where we have defined for i = 1, . . . , n 




t cVb - {Xi - X) 



Since Z2 and Z% are centered, replacing the previous expansion in (flOl) yields 



^ it X 1 2 

A = - J2 z Uv~ lz i,i + - E z ii y ~ lz ^ + - E z i^ z ^ - - E z i^ z ^ 



i=l i=l i=l i=l 
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2 n 



n . 



where 
1 n 

ZLV^Zu = (u- X + s t cVb) t V~ 1 (n- X + s t cVb) 

n — ' 

i=l 
1 - 

-Y j Z t 2i V- l Z 2}l = (c'Wc^Vb) 

±J2 z li v ~ lz 3,i = -Y J ^ x l -x) t v-\x l -x)) = -Y J ^{v- l (x t -x){x t -x) t ) 

i=i i=i i=i 

= tr(y -1 E) 

^Z*^- 1 ^ = 2c l Mb, 

i=l 

and the conclusion follows. ■ 

Proof of Proposition [TJ. Annulling the gradients of G(/jl, V, b, c) yields the system of 
equations 

Iv^G = V- 1 (fi-X + s t cVb)=0, (11) 
Iv 6 G = ((c t sf + c t Wc)Vb-M t c + (c t s)(fi-X) = 0, (12) 
Iv c G = (&*t>S)(ss* + W)c-M6 + (/t-X)% = 0, (13) 
V V G = V- 1 + bti {{c l s) 2 + c*Wc) - V- 1 (<ji -X)(fi- Xf + S) V^ 1 = 0. (14) 

From (llip . we have 

A = X - s*cT>S. (15) 
Replacing in (fT2|) and (fT3j) yields the simplified system of equations 

{^WcjVb = M*c, (16) 
{tfVbjWc = Mb. (17) 

Assuming W is regular, equation (fT7|) entails 

c = W^Mb/ttVb 

and replacing in (fl~6|) yields 

(c*Wc)(S*fS)F6 = Af^W^AfS. (18) 
Now, multiplying (I14p on the left and on the right by V and taking account of (|15p entail 

£ = V + (^WcjVbPV. (19) 
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As a consequence of ([19]) . we have 

±b=(l + (c*Wc)(6*yS)) t>6, (20) 
which means that Eb is proportional to Vb. Consequently, one also has 

Vb = 9(b)£b, (21) 

where we have defined 

«$> = 22. 

Substituting (|2T |) in ([15]) yields 

(c t Wc)(b t Vb)9(b)tb = M l W~ x Mb 
and thus b is an eigenvector of T,~ 1 M t W~ 1 M. Let us denote by A the associated eigenvalue 

A = (c*W r c)(S*t>S)0(6). (22) 

Collecting ((2Qj) and flU) yields 

_L = i + (c*wc) (S*^S), 

9(b) 

and thus the eigenvalue can be rewritten as 

A = 1 - 0(b). 

Moreover, we have, from (116ft . 

&Mb = X/9(b), (23) 
tiitV^ 1 ) = p + X/9(b), (24) 
log det V = log det E - log det f/ p + (c* Wc)y66* N 

= logdetE-log(l + A/0(6)V (25) 
entailing 

G(ji,V,b,c) = p + logdet£-\og(l + \/0(b) 

= p + log det E + log(l - A). 

As a consequence, to minimize G, A should be the largest eigenvalue. Finally, let us 
consider 

V = E - J^tbtft, 
ft'Eb 



leading to 

Vb + (^WcjVbSS'Vb = E + ( (c*Wc)0 2 (6) - ] tWt 

6*E6 



E + i^S ((c'V7c)(&'Vb6>(6) - A) Eta*E 
E, 



in view of (|22p and thus Vb verifies equation ([19 
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Proof of Corollary [TJ Let us remark that, under ([7|), the coefficients of Wij of W are 
given by Wj,- = fil{i = j} — fifj for all G {1, . . . , /i} 2 . The inverse matrix of W is 

W" 1 = diag (1 . . , +-!-£/, 

where U is the h x h matrix defined by Uij = 1 for all G {1, . . . , h} x {1, . . . , /i}. 

Since the j'th line of M is given by fj(Xj — X) f for all j = 1, . . . , h and taking account of 
U 2 = fh+iU, we have 

1 



M l W~ x M = V /^(X,- - X)(X,- - X)* + M l UM 

h 1 



Now, remarking that all the columns of M U are equal to 
h h 

i=i j=i 

= -fh+i(X h+1 -X), 

it follows that 

(M^)(M*C/)' = hf 2 h+1 {X h+1 - X)(X h+1 - X)* 
and thus replacing in ([26]) yields 



h+1 

M l W~ x M = ^ fj( X J ~ x )( x j ~ X Y 

3=1 

= f. 



The result is then a consequence of Proposition [TJ 



Proof of Lemma [2l The joint distribution of (X, b) given Y denoted by p(X,b\Y) is 
calculated as the product p(X\Y,b)p(b\Y) where p(X\Y,b) is given by © and p(b\Y) is 
given by ([8]). The estimators are obtained by minimizing 



Jn(/i,F,6,c) = --J2logp(X i ,b\Y l ) = --y j logp(X i \Y i ,b)--y^\ogp(b\Y i ) 

b l Vt- l b 1 



n * — ' n £ — ' n 

i=l i=l i=l 



1 + p n 

i=i 

G( M , V, 6, c) + _(&*n-i&)(^Wc) + c ste , 

ste 



which is Gq(/U, V, 6, c) up to the constant C . 
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Proof of Proposition [2J, The proof is similar to the one of Proposition [TJ First, remark 
that equation still holds and thus fi = X—s l cVb. Let us recall the following definitions 

0(S) = !t! and „(S) = ££* 
Annulling the gradients of Gq((j,, V, 6, c) yields the system of equations 

(#Wc) (nVb + 0(S)S + r)(b)n(Vb - 6»(S)E6)) = JW*c, (27) 

(b t Vb)(l + 'q(b))Wc = Mb, (28) 
F -1 + bb t (c t Wc)(l + 77(6)) = V^tV' 1 . (29) 

Assuming W is regular, equation ([25]) entails 

c = w _1 mS/((i + 77(S))(S*fS)) 

and replacing in ([271) yields 

(#Wc) (S*vS)(i + r?(S)) (nvS + 0(5)6 + ^(6)^(^6 - e{h)tb)^ = v.m'w \ii>. (30) 

Now, multiplying (|29p on the left and on the right by V, it follows that 

E = V + (#Wc)(l + ri(b))vWv, 

leading to 

E6 = (l + (&Wc)(b t Vb)(l + »7(S))) Ffe, (31) 
which means that Efe is proportional to As a consequence, one also has 

Vb = 9(b)£b, (32) 

and replacing in ([30j) yields 

(c^c) (6*yS)(l + r)(b))9(b) (nt + Ip\b= ^lM t W~ 1 Mb, 

and thus 6 is an eigenvector of + I p ) Q,M t W M . Let us denote by A the associated 
eigenvalue 

A = (^W r 6)(S*V r S)(l + v(b))0(b). 
Collecting (M]) and (32J) yields 

^ = l + (c t ^c)(6^6)(l + r ? (6)) 

and consequently the eigenvalue can be rewritten as 

A = 1 - 0(b). 

Now, let us remark that (|23|) . (|24p and (|25 [) still hold in this context entailing 

G a (Ji,V,b,c) =p + logdetE + log(l - A). 

As a consequence, to minimize Gq, A should be the largest eigenvalue. The end of the 
proof follows the same lines as the one of Proposition [TJ ■ 
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Proof of Proposition [3j Let P be the projection matrix on Sd- 

d 

and, for alH = 1, . . . , n consider the projected predictor defined by X{ = PX{. Introducing 
r = PTP the empirical " between slices" matrix associated to X\ , . . . , X n and £ = PEP 
the corresponding covariance matrix, the PCA+SIR method finds b such that 

f 6 = \tb, 

where A € R, or equivalently, 

Pf P6 = APSP6. 
Remarking that PEP = SP, we have, for all r > 0, 

PfP6 = — - — (PS + tS)P6, 
1 + r 

and defining 6 = P6 and A = A/(l + r), it follows that 

Pf 6 = A(PS + rS)6. 

Since P = rEf^i we thus have 

tn 2 tb = \(tn 2 t + t)b, 

which means that b is an eigenvector of (f^E + /p) -1 ^^ '. Corollary [2] concludes the proof, 
i.e. 6 is the GRSIR estimator with prior covariance matrix Q2- ■ 

Proof of Proposition [4l Here, we adopt the notations introduced in the proof of 
Proposition [3j The GRSIR estimator b computed on the projected predictors X%, . . . ,X n 
verifies 

n((p)fb = A(J%)£ + I p )b, 

or equivalently 

Q(tp)PfPb = \(tt(ip)P±P + I p )b. 
Multiplying this equation by P on the left, we obtain 

Pn{ip)PfP~b = \(Ptt(cp)PEP + P)b. 

Since PVt{ip)P = f2(<£>), and introducing b = Pb, it follows that 

Sl(<p)t b = X(Q((p)t + I p )b, 

which means that b is an eigenvector of (f2(c/?)£ + I p )~ 1 ft(<fi)r. Corollary [2] concludes the 
proof, i.e. b is the GRSIR estimator with prior covariance matrix Q(ip). ■ 
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(a) Criterion MSC (b) Criterion VSC 



Figure 1: Values of MSC and SSC with respect to the regularization parameter for Model 
1. The condition number is fixed to 9 = 2. Horizontally: log(r), vertically: (a) MSC and 
(b) VSC. Continuous line: (SIR), "-o-" line: Oi (ridge), dashed line: (Tikhonov). 




(a) Criterion MSC (b) Criterion VSC 



Figure 2: Values of MSC and VSC with respect to the regularization parameter for Model 
1. The cut-off dimension is chosen to d = 20 and the condition number is fixed to 6 = 2. 
Horizontally: log(r), vertically: (a) MSC and (b) VSC. Continuous line: tl 2 (PCA+SIR), 
"-o-" line: ft 4 (PCA+ridge), dashed line: ft 5 (PCA+Tikhonov). 
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Figure 3: Values of MSC and VSC with respect to the regularization parameter for Model 
2. The condition number is fixed to 9 = 2. Horizontally: log(r), vertically: (a) MSC and 
(b) VSC. Continuous line: (SIR), "-o-" line: Oi (ridge), dashed line: (Tikhonov). 




logW logft) 
(a) Criterion MSC (b) Criterion VSC 



Figure 4: Values of MSC and VSC with respect to the regularization parameter for Model 
2. The cut-off dimension is chosen to d = 20 and the condition number is fixed to 6 = 2. 
Horizontally: log(r), vertically: (a) MSC and (b) VSC. Continuous line: tl 2 (PCA+SIR), 
"-o-" line: ft 4 (PCA+ridge), dashed line: 5 (PCA+Tikhonov). 
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Figure 5: An example of pairs (/3 Xj, b Xi), i = 1, . . . ,n for Model 1 where the axis b is 
computed with the classical SIR method (points "o") and the PCA+Tikhonov method 
(points "x"). Here, 9 = 2 and d = 20. 




Figure 6: An example of pairs (ftX^tfXi), % = 1, . .. ,n for Model 2 where the axis b is 
computed with the classical SIR method (points "o") and the PCA+Tikhonov method 
(points "x"). Here, 9 = 2 and d = 20. 
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Figure 7: Sensibility of GRSIR with respect to the condition number of the covariance 
matrix for Model 1. Here, we use the optimal r. Horizontally: 9, vertically: the criterion 
MSC. Continuous line: CIq (SIR), "-o-" line: £1% (ridge), dashed line: SI3 (Tikhonov). 




5 2 2.5 3 



Figure 8: Sensibility of GRSIR with respect to the condition number of the covariance 
matrix for Model 1. The cut-off dimension is chosen to d = 20 and we use the optimal 
r.. Horizontally: 9, vertically: the criterion MSC. Continuous line: Q2 (PCA+SIR), "-o-" 
line: 4 (PCA+ridge), dashed line: tt 5 (PCA+Tikhonov). 
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Figure 9: Sensibility of GRSIR with respect to the cut-off dimension for Model 1. Hor- 
izontally: d, vertically: the MSC criterion. Continuous line: (PCA+SIR), "-o-" line: 
O4 (PCA+ridge), dashed line: (PCA+Tikhonov). Here, 6 = 2 and the optimal r is 
used. 
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Figure 10: Functional relationship between reduced spectra f3 X on the first GRSIR di- 
rection and Y, the grain size of CO"2- Horizontally: reduced spectra from the learning 
database on the first GRSIR direction. Vertically: Grain size of C02- 
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Figure 12: Grain size of CO2 ice estimated by GRSIR (PCA+ridge) on an hyperspectral 
image observed on Mars during orbit 61. 
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